Wind Disaster Risk Analysis in Sleman Regency¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium
import plotly.express as px
import plotly.graph_objects as go
import branca.colormap as cm

Project Overview¶

This project explores wind disaster risk across districts in Sleman Regency using clustering techniques. The analysis aims to identify high-risk districts, detect unusual patterns, and provide actionable insights for disaster mitigation.

Objectives:

  • Identify high-risk districts using K-Means clustering.
  • Detect annual trends and outlier districts using DBSCAN clustering.
  • Highlight key risk factors such as fallen trees, power network disruptions, and frequency of incidents.

Data Loading & Cleaning¶

We start by loading the dataset obtained from Badan Penanggulangan Bencana Daerah. The raw Excel data was cleaned, and key features were extracted for analysis.

Load dataset¶

In [2]:
sleman = pd.read_csv("DATA FIX.csv")

These are the elements of the dataset:

In [3]:
sleman.keys()
Out[3]:
Index(['Kecamatan', 'Tahun', 'BanyakKejadian ', 'PohonTumbang ',
       'JaringanListrik '],
      dtype='object')

Remove extra spaces from column names to ensure consistency.

In [4]:
sleman.columns = sleman.columns.str.strip()

The dataset has 34 entries and 5 columns:

In [5]:
sleman.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Kecamatan        34 non-null     object
 1   Tahun            34 non-null     int64 
 2   BanyakKejadian   34 non-null     int64 
 3   PohonTumbang     34 non-null     int64 
 4   JaringanListrik  34 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.5+ KB
In [6]:
print(sleman.isnull().sum())
Kecamatan          0
Tahun              0
BanyakKejadian     0
PohonTumbang       0
JaringanListrik    0
dtype: int64

The dataset was cleaned and verified with no missing values.

Load shapefile¶

The Kecamatan shapefile was imported and converted to standard CRS (EPSG:4326).

In [7]:
kecamatan = gpd.read_file("shp sleman.shp").to_crs(epsg=4326)

Exploratory Data Analysis¶

We focus on key features representing the impact of wind disasters:

In [8]:
features = ['BanyakKejadian','PohonTumbang','JaringanListrik']

Distribution of features¶

We visualize the distribution of disaster-related features across to identify extreme values (hotspots).

In [9]:
# Convert Data
df_melt = sleman.melt(value_vars=features, var_name='Feature', value_name='Value')

#Create boxplot
fig = px.box(df_melt, x='Feature', y='Value', color='Feature', title="Distribution Feature")
fig.update_layout(showlegend=False)
fig.show()

Number of incidents and fallen trees show high outliers, indicating specific districts face very high risk. These are the primary wind disaster hotspots.

Feature correlation¶

Before clustering, it's useful to understand how the features relate to each other.

In [10]:
plt.figure(figsize=(8,6))
sns.heatmap(sleman[features].corr(), annot=True, cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap"); plt.show()

From the heat map, we can see that fallen trees strongly correlate with power network disruption, highlighting them as a key factor amplifying disaster impact. Total incidents show moderate correlation with other features.

Clustering Analysis¶

Feature scaling¶

To prepare the data for clustering, we need all features to contribute equally. Since the features have different scales, we scale them to a 0-1 range.

In [11]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(sleman[features])

1. K-Means Clustering¶

In [12]:
from sklearn.cluster import KMeans

Determining the Optimal Number of Clusters¶

To identify the most suitable number of clusters, two complementary approaches were applied:

1. Elbow Method¶

The Elbow Method evaluates the Within-Cluster Sum of Squares (WSS) for different values of k. The "elbow" point indicates the optimal cluster count.

In [13]:
wss = [KMeans(n_clusters=i, random_state=42, n_init=10).fit(X_scaled).inertia_ for i in range(1,10)]

# Create plot
plt.plot(range(1,10), wss, marker='o')
plt.xlabel("k"); plt.ylabel("WSS"); plt.title("Elbow Method"); plt.show()

2. Silhouette Analysis¶

Silhouette Analysis measures how well each data point fits within its assigned cluster. A higher Silhouette Score indicates better-defined clusters.

In [14]:
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import SilhouetteVisualizer

# Range k cluster
k_values = [2, 3, 4, 5]

for k in k_values:
    model = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = model.fit_predict(X_scaled)
    
    # Create silhouette score
    score = silhouette_score(X_scaled, labels)
    
    # Create plot
    fig, ax = plt.subplots(figsize=(6,4))
    SilhouetteVisualizer(model, ax=ax).fit(X_scaled).finalize()
    plt.title(f"Silhouette k={k} (Score={score:.3f})", fontsize=11)
    plt.show()

Elbow and Silhouette analyses suggest 3 clusters as the optimal choice.

Applying K-Means¶

In [15]:
kmeans = KMeans(n_clusters=3, random_state=123, n_init=10)
sleman['KMeans_Cluster'] = kmeans.fit_predict(X_scaled)

Assign Risk Levels¶

Calculated the centroids of each cluster and summed their feature values.

In [16]:
# Calculate  centroid
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=features)
centroids['Total'] = centroids.sum(axis=1)

# Assign risk levels
risk_order = ['High','Moderate','Low']  # highest Total = High Risk
centroids_sorted = centroids.sort_values('Total', ascending=False).reset_index()
cluster_to_risk = {row['index']: risk_order[i] for i,row in centroids_sorted.iterrows()}
sleman['KMeans_Risk'] = sleman['KMeans_Cluster'].map(cluster_to_risk)

3D Scatter Plot K-Means¶

In [17]:
fig = px.scatter_3d(sleman, x='BanyakKejadian', y='PohonTumbang', z='JaringanListrik',
                    color='KMeans_Risk', symbol='KMeans_Risk', hover_data=['Kecamatan', 'Tahun'])

colors = {'High':'red','Moderate':'green','Low':'blue'}
for i,row in centroids_sorted.iterrows():
    fig.add_trace(go.Scatter3d(x=[row['BanyakKejadian']], y=[row['PohonTumbang']], z=[row['JaringanListrik']],
                               mode='markers', marker=dict(size=12, color=colors[risk_order[i]], symbol='x'),
                               name=f"{risk_order[i]} Centroid", showlegend=False))
fig.update_layout(title="K-Means Clusters 3D", legend_title_text='Risk Level')
fig.show()

Insight:

  • High Risk - "Red zone"
    Districts like Cangkringan stand out with high incidents, severe tree falls, and frequent power outages the top priority for mitigation.

  • Moderate Risk – “Green zone”
    Mid-level points with mixed patterns, e.g., high tree falls but low outages, signaling potential escalation if not addressed.

  • Low Risk – “Blue zone”
    Close to the origin with minimal events, these districts remain relatively safe but still require monitoring.

2. DBSCAN Clustering¶

In [18]:
from sklearn.cluster import DBSCAN

Determining eps¶

In [19]:
from sklearn.neighbors import NearestNeighbors

# Set parameter min_samples
min_samples = 3

# Compute distances to the nearest neighbors
distances, _ = NearestNeighbors(n_neighbors=min_samples).fit(X_scaled).kneighbors(X_scaled)

# Plot K-Distance
plt.plot(np.sort(distances[:, min_samples-1]))
plt.title("K-Distance Plot")
plt.xlabel("Points sorted by distance")
plt.ylabel(f"Distance to {min_samples}th nearest neighbor")
plt.show()

From the K-Distance Plot, the value 0.15 was chosen at the elbow point, where the slope of the curve increases sharply.

Applying DBSCAN¶

In [20]:
eps_optimal = 0.15
sleman['DBSCAN_Cluster'] = DBSCAN(eps=eps_optimal, min_samples=min_samples).fit_predict(X_scaled)

After applying DBSCAN, we map each cluster label to a descriptive risk level.

In [21]:
risk_map_dbscan = {-1:'Noise', 0:'Low', 1:'Moderate', 2:'High'}
sleman['DBSCAN_Risk'] = sleman['DBSCAN_Cluster'].map(risk_map_dbscan)

DBSCAN Trend Analysis¶

We track DBSCAN cluster labels by year for each district to see movement across risk levels. To prepare the data, we "unpivot" the dataframe from wide format to long format.

In [22]:
# Create a pivot table 
trend_dbscan = sleman.pivot_table(index='Kecamatan', columns='Tahun', values='DBSCAN_Cluster').reset_index()

# Convert to long format
trend_long = trend_dbscan.melt(id_vars="Kecamatan", var_name="Year", value_name="Cluster")
trend_long['RiskLevel'] = trend_long['Cluster'].map(risk_map_dbscan)

# Create line chart
fig = px.line(trend_long, x='Year', y='Cluster', color='Kecamatan', markers=True, hover_data=['RiskLevel'], 
              labels={'Kecamatan':'District'})
fig.update_yaxes(tickvals=[-1, 0, 1, 2], title="Cluster (DBSCAN)")
fig.update_layout(title="Wind Disaster Risk Trends")
fig.show()

By visualizing year-over-year trends, we observed that districts such as Tempel, Berbah, Ngemplak, and Kalasan rose from Low to Moderate risk, highlighting areas needing closer monitoring. Godean and Sleman dropped to Noise, indicating minimal or inconsistent risk, while Prambanan and Depok reduced from High to Low, showing effective mitigation. This demonstrates that wind disaster risk is dynamic, and adaptive strategies are essential to focus on rising-risk districts while keeping an eye on outliers.

Merge Clustering¶

We merged the Sleman district shapefile with data Sleman per year.

In [23]:
df_gdf = kecamatan.merge(sleman, left_on='KECAMATAN', right_on='Kecamatan', how='left')

Then we calculated a total feature score for each district by summing key indicators.

In [24]:
# Calculate total feature
df_gdf['Total_Feat'] = df_gdf[features].sum(axis=1)

# Calculate persentase Total_Feat
trend_total = df_gdf.pivot_table(index='Kecamatan', columns='Tahun', values='Total_Feat')
trend_total['Pct_Change'] = ((trend_total[2021]-trend_total[2020])/trend_total[2020]*100).round(1)

# Merge Pct_Change
df_gdf = df_gdf.merge(trend_total['Pct_Change'], left_on='Kecamatan', right_index=True)
df_gdf['Risk_Change'] = df_gdf.groupby('Kecamatan')['KMeans_Risk'].transform(lambda x: x.iloc[-1]!=x.iloc[0])

Multi-Year Map¶

We visualized wind disaster risk in Sleman districts for 2020–2021 using an interactive map. Districts are color-coded by K-Means and DBSCAN clusters, reflecting different risk levels per method. Users can toggle layers and hover over districts to track year-over-year changes, spot hotspots needing mitigation, and recognize areas that remain stable or resilient.

In [25]:
def plot_multi_year_highlight_map(gdf, years):
    # Colormap total features
    colormap_total = cm.LinearColormap(['green','yellow','red'], vmin=gdf['Total_Feat'].min(), vmax=gdf['Total_Feat'].max(), 
                                       caption='Total Feature Count')
    
    # Default color per method
    colors_dict = {
        'KMeans': {'Low':'#00FFFF','Moderate':'#F77F00','High':'#D62828'},
        'DBSCAN': {'Noise':'#A9A9A9','Low':'#008000','Moderate':'#FFA500','High':'#FF0000'}}
    
    # Base map
    m = folium.Map(location=[-7.7,110.4], zoom_start=11, tiles="cartodbpositron")
    
    for year in years:
        gdf_year = gdf[gdf['Tahun']==year]
        
        for method in ['KMeans','DBSCAN']:
            fg = folium.FeatureGroup(name=f'{method} {year}')
            
            # Add GeoJson
            folium.GeoJson(gdf_year,
                style_function=lambda f, method=method, colors=colors_dict[method]: {
                    'fillColor': colors.get(f['properties'].get(f"{method}_Risk"), 'Low'),
                    'color':'black',
                    'weight': 3 if f['properties'].get(f"{method}_Risk") in ['Low','Noise'] else 1,
                    'fillOpacity':0.7},
                tooltip=folium.GeoJsonTooltip(
                    fields=['Kecamatan', f'{method}_Risk','BanyakKejadian','PohonTumbang','JaringanListrik','Total_Feat','Pct_Change'],
                    aliases=['Kecamatan','Risk Level','Banyak Kejadian','Pohon Tumbang','Jaringan Listrik','Total Features','% Change 2020→2021'])
            ).add_to(fg)
            fg.add_to(m)
    
    # Add colormap & layer control
    colormap_total.add_to(m)
    folium.LayerControl().add_to(m)
    
    return m

map_highlight = plot_multi_year_highlight_map(df_gdf, [2020, 2021])
map_highlight
Out[25]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Risk Escalation Analysis¶

Calculate percentage of districts that escalated to High risk in 2021.

In [26]:
pct_high_increase = round(
    trend_long[(trend_long['Year'] == 2021) & (trend_long['RiskLevel'] == 'High')].shape[0] 
    / trend_long['Kecamatan'].nunique() * 100, 1)

print(f"Out of {trend_long['Kecamatan'].nunique()} districts, {pct_high_increase}% escalated to High risk in 2021.")
Out of 17 districts, 0.0% escalated to High risk in 2021.

Out of 17 districts, 0.0% escalated to High risk in 2021, confirming that mitigation efforts successfully prevented new high-risk zones.

Risk Distribution per Year¶

Let's create a stacked bar to see distribution of risk levels across districts for each year.

In [27]:
# Create summary of district counts per RiskLevel for each year
trend_summary = trend_long.groupby(['Year', 'RiskLevel']).size().reset_index(name='Count')

# Create stacked bar
fig = px.bar(trend_summary, x='Year', y='Count', color='RiskLevel',
             color_discrete_map={'Low':'#2A9D8F','Moderate':'#F77F00','High':'#D62828','Noise':'grey'},
             title="Wind Disaster Risk Distribution each Year", labels={'Count':'Number of Districts'},)

fig.update_layout(barmode='stack', xaxis=dict(tickmode='linear'), yaxis_title='Number of Districts',
                  legend_title_text='Risk Level', template='plotly_white', font=dict(size=12))

fig.show()

The chart clearly shows that High-risk districts decreased or remained stable, while Moderate-risk districts need ongoing monitoring. Low-risk districts maintained safety, demonstrating resilience. Noise segments reveal outliers or districts with irregular patterns that require targeted investigation. This visualization provides stakeholders a year-over-year perspective to prioritize mitigation and allocate resources effectively.

Conclusion¶

The wind disaster risk analysis in Sleman Regency reveals a dynamic landscape of resilience and vulnerability. High-risk districts (“red zones”) like Cangkringan demand urgent mitigation, while moderate-risk areas act as early-warning zones requiring close monitoring. Low-risk districts remain relatively safe but must still be observed, especially where outlier patterns appear. Year-over-year trends from 2020 to 2021 show no new districts escalated to high risk, highlighting the effectiveness of ongoing mitigation. By combining K-Means and DBSCAN clustering, along with interactive maps and trend visualizations, this analysis provides actionable insights to prioritize interventions, allocate resources efficiently, and strengthen community preparedness.